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1.0  DATA  BASES  USED 


The  first  phase  of  this  study  involved  generating  an  algorithm  for  giving 
a probability  that  two  oil  samples  (the  suspect  sample  and  the  spill  sample) 
are  the  same.  This  was  done  using  fluorescence  spectra  of  some  230  oil  samples 
furnished  by  the  Coast  Guard  Research  and  Development  Center.  These  include 
crude  oils,  heating  oils  and  lubricating  oils. 

The  second  phase  of  this  study  involved  independence  of  infrared  and 
fluorescence  spectra  of  oils.  For  this  phase  we  had  infrared  and  fluorescence 
spectra  of  30  oil  samples.  The  fluorescence  spectra  were  furnished  by  the 
Cijast  Guard  R&D  Center  and  the  infrared  spectra  were  furnished  by  J.  Mattson. 
These  oils  are  listed  in  Table  1. 

2.0  DIGITIZING  DATA 

A spectrum  S is  the  graph  of  a function  defined  on  some  interval  on  the 
real  line.  To  do  computations  we  need  to  approximate  the  spectrum  S by  a vector 
in  some  Euclidean  space  R";  i.e.,  with  an  ordered  n-tuple  X of  real  numbers  (1) . 
This  should  be  done  in  a manner  dictated  by  the  computations  to  be  made.  In 
general  we  would  like  to  make  n as  small  as  we  can  without  losing  so  much 
information  that  our  computational  goal  cannot  be  achieved.  Thus  we  want  the 
numbers  going  into  our  n-tuple  X to  contain  the  kind  of  information  we  seek. 

Here  is  a specific  example.  Fluorescence  spectra  of  oils  decay 
exponentially  after  some  wavelength  (e.g.,  about  380  nm  for  crudes).  Now  an 
exponential  curve  is  completely  determined  by  knowing  two  points  on  it.  So  if 
we  read  amplitudes  at  more  than  two  points  in  the  exponential  tail-off  region, 
we  are  increasing  n without  getting  any  more  information  for  identification 
purposes. 

Similarly,  if  we  know  that  between  a peak  and  an  adjoining  valley  the 
spectrum  is  essentially  linear,  we  get  no  Information  by  reading  an  amplitude 
in  this  linear  region. 

Now  the  ordered  n-tuple  X of  real  numbers  does  rot  have  to  be  simply  the 
amplitudes  of  S read  at  certain  wavelengths.  Any  real  number  assignment  which 
gives  good  differences  for  some  different  samples  is  a candidate  for  a component 
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TABLE  1 


of  X (2) . For  example,  one  may  assign  a real  number  describing  sharpness  of  a 
peak  or  valley.  But,  of  course,  the  1*^^  coordinate  of  X must  be  measuring  the 
same  thing  for  each  spectrum. 

For  fluorescence  spectra  of  oils  we  have  the  exponential  tail-off  and  we 
have  essential  linearity  between  a peak  and  adjacent  valley.  Also  amplitude 
readings  at  all  peaks  and  valleys  is  insufficient  for  we  would  have  spectra 
Si  S2  for  different  spectra  and  yet  the  corresponding  vectors  Xj^,X2  are 
almost  identical.  Thus  components  for  sharpness  must  be  added  in. 

3.0  CLUSTERING 

In  the  230  fluorescence  spectra,  assigning  amplitudes  for  all  valleys  and 
peaks  and  assigning  sharpness  factors  for  all  valleys  and  peaks  gives  for  each 
spectrum  S a vector  X in  68-dlmenslonal  space,  X e R”°.  Since  probabilistic 
calculations  involve  measuring  a large  number  of  distances  between  vectors, 

' Q 

they  would  be  beyond  the  capabilities  of  most  computers  with  230  vectors  in  R'“  , 

However,  examination  of  these  spectra  shows  that  they  naturally  fall  into 
33  groups  or  clusters : the  intercluster  distances  being  great  compared  with 

intracluster  distances.  (For  a review  of  cluster  analysis,  see  reference  (3).) 
Now,  within  a given  cluster,  some  peaks  and  valleys  are  completely  absent  and 
readings  taken  there  would  be  wasted.  Thus,  for  a given  cluster,  we  can  greatly 
reduce  the  dimension  of  the  vector  X assigned  to  a spectrum.  Of  course  we 
still  need  sharpness  factors,  because  the  problem  of  S]^  4 S2  but  Xj^  = X2  (when 
not  using  sharpness)  quite  naturally  occurs  for  Si  and  S2  in  the  same  cluster. 
The  dimension  of  vectors  X which  give  good  separation  within  a cluster  was  about 
12  to  14. 

4.0  ASSUMPTIONS  FOR  PROBABILITY  COMPUTATION 

1.  Reproducibility  of  spectra  is  sufficiently  accurate  that  a spectrum 
may  be  considered  a point  in  R”. 

2.  The  library  is  sufficiently  representative  so  that  for  any  given  oil 
we  can  unambiguously  say  that  it  is  "essentially  like"  exactly  one  in  the 
library. 
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3.  The  clusters  have  been  chosen  so  that  there  Is  essentially  zero 
probability  of  any  two  oils  being  the  same  if  they  fall  into  different  clusters. 
Also  we  assume  that  assigning  any  oil  to  the  cluster  with  the  nearest  centroid 
gives  the  right  assignment. 

With  these  assumptions  the  algorithm  given  in  the  next  section  gives  a com- 
bined probability  satisfying  the  standard  conditions  required  of  a probability 
function.  These  standard  conditions  are  listed  in  any  text  on  probability 
theory  such  as  references  (4)  or  (5) . 

5.0  THE  ALGORITHM 

The  probability  that  two  unknowns  X and  Y (spill  and  suspect)  are  the 
same  is  obtained  by  combining  two  probability  factors.  (This  is  all  for 
fluorescence  — not  to  be  confused  with  our  later  combinations  of  probabilities 
for  fluorescence  and  infrared.) 

Let  N be  the  number  of  spectra  in  the  library  (about  230).  Let  Cj^,C2,  ..., 
■33  denote  the  number  of  elements  in  the  clusters  (so  that  C3  + C2  + ...  + C33  = 
, . Let  be  the  cluster  containing  X.  If  Y is  also  in  Ci  we  set 


Pi 


N - Ci 
N 


and  if  Y is  not  in  Ci  we  use  zero.  (This  slight  lack  of  symmetry  in  treatment 
of  X and  Y is  unimportant,  for  if  X and  Y are  in  different  clusters,  the  proba- 
bility of  identity  is  quite  low) . 

If  X and  Y are  not  in  the  same  cluster,  we  say  the  total  probability  of 
Identity  is  0.  If  X and  Y are  in  the  same  cluster  (say  the  i*^^)  we  calculate  a 
second  probability  factor  P2  as  follows:  We  now  represent  all  vectors  in  the 

i^^  cluster  as  vectors  in  the  space  of  dimension  12  to  14  used  for  that  cluster. 
Let  X'  and  Y'  now  denote  the  spill  and  suspect  in  this  space.  Let  n^  be  the 
number  of  vectors  in  the  i*^^  cluster  within  a distance  (d(X',Y')  of  X'.  Define 
ny  similarly.  Set 


nx+ny 


P2  = 


4 


The  final  probability  then  that  X = Y is  given  by: 

P = 1 - (1-Pi)(l-P2) 
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6.0  THE  COMPUTER  PROGRAM 

The  computer  program  on  punched  cards  has  been  sent  to  the  Coast  Guard  R&D 
Center  in  Groton.  It  suffices  to  say  here  that  the  program  will:  (1)  compute 

the  sharpness  factors  needed  to  augment  the  amplitude  readings,  (2)  do  linear 
interpolation  and  exponential  interpolation  to  generate  the  68-dimensional 
vectors  for  the  entire  library,  (3)  decide  (using  these  68-dimensional  vectors) 
to  which  clusters  X and  Y belong,  and  (4)  carry  out  the  probability  computa- 
tion as  described  above  in  the  algorithm. 

On  several  unknowns  this  program  proved  completely  successful.  But  for 
some  different  library  of  spectra,  it  should  not  be  used  without  some  refine- 
ments which  were  not  needed  for  our  library  of  fluorescence  spectra. 

7.0  INDEPENDENCE  OF  DATA 

The  purpose  of  this  study  was  very  limited  and  specific.  Our  objective 
was  to  investigate  the  degree  of  independence  or,  conversely,  dependence  between 
data  obtained  from  the  infrared  and  fluorescence  spectra  of  oils.  Other  work 
has  been  done  on  this  problem  by  Killeen  and  Chien  (6)  . 

The  motivation  behind  such  a study  is  quite  simple.  Various  procedures 
exists  for  determining  the  probability  of  a match,  Pj^,  between  a spill  sample 
and  a source  sample  using  data  obtained  from  the  fluorescence  spectra  of  oils 
(7).  In  addition,  procedures  have  been  developed  for  determining  the  probability 
of  a match,  P2,  using  data  obtained  from  the  infrared  spectra  of  oils  (8).  What 
is  needed  is  a method  for  combining  the  individual  probabilities  of  a match  into 
an  overall  probability  of  a match,  Pj^.  Obviously,  such  a probability  would  be 
a more  accurate  and  reliable  measure  of  the  probability  of  a match  than  either 
of  the  individual  probabilities,  for  it  incorporates  more  information  about  the 
problem  at  hand. 

In  order  to  develop  the  means  for  combining  the  individual  probabilities, 
an  investigation  of  the  Independence  of  the  two  types  of  data,  that  obtained 
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from  the  infrared  spectra  and  that  obtained  from  the  fluorescence  spectra,  must 
be  made.  This  paper  describes  the  work  which  has  been  conducted  at  Rice 
University  to  accomplish  this  end. 


7 . 1 Data  Base  Used 

The  data  base  used  in  this  investigation  came  from  30  differerit 
types  of  oils  and  was  supplied  by  the  U.  S.  Coast  Guard  R&D  Center  in  Groton, 
Connecticut.  These  oils  are  listed  in  Table  1.  The  oil  types  broke  down  as 
follows:  six  No.  2 fuel  oils,  two  No.  4 fuel  oils,  four  No.  5 fuel  oils,  six 
No.  6 fuel  oils,  two  No.  2 diesel  oils,  nine  diesel  oils  (unspecified),  and  one 
crude  oil  from  Venezuela. 

The  infrared  spectra  were  obtained  using  a Perkin-Elmer  Model  180 
infrared  spectrophotometer  (with  cells  of  pathlengths  0.12+0.03  mm)  which  used 
electronic  ratioing  (rather  than  usual  optical  null) . The  spectrophotometer 
had  an  ordinate  resolution  (in  %T)  of  one  part  in  12,000  and  an  abscissa  resolu- 
tion (in  wavenumbers)  of  0.01  cm~^.  It  was  interfaced  with  a Data  General  NOVA 
1200  minicomputer. 

The  fluorescence  spectra  were  obtained  from  oil  samples  diluted  to 
100  ppm  in  cyclohexane  (wt/wt)  run  on  an  uncorrected  Perkin-Elmer  MPF-3  spec- 
trof luorometer  with  excitation  at  254  nm  and  bandwidths : excitation,  34  nm  and 

emission,  1.5  nm. 

For  each  of  the  oils  in  the  data  base,  samples  of  the  amplitudes  of 
the  infrared  spectra  at  various  frequencies  were  determined.  In  addition, 
measurements  of  the  amplitudes  of  the  fluorescence  spectra  at  selected  fre- 
quencies were  supplied.  The  frequencies  used  for  the  infrared  study  are  given 
in  Table  2. 

By  digitizing  or  sampling  each  spectra  at  a number  of  frequencies, 
one  can  get  a series  of  real  numbers  representative  of  each  spectra.  In  the 
present  study,  20  measurements  of  the  Infrared  and  20  measurements  of  the 
fluorescence  spectra  of  each  oil  were  made.  These  measurements  can  be  listed 
in  vector  form.  We  shall  let  the  vectors  V.^  = >'^i2 » • • • > ^120^  ^ ~ ('^il» 

Wj^2> • • • »Wi20)  represent  the  infrared  and  fluorescence  spectra  of  the  i-th  oil 
of  the  data  base.  In  the  following  section,  the  experiments  conducted  to  deter- 
mine the  degree  of  independence  between  the  two  types  of  data  are  discussed. 
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TABLE  2 


FREQUENCIES  AT  WHICH  SAMPLES  WERE  TAKEN 


7.2  Experiments  Performed  to  Determine  the  Degree  of  Independence 


Since  both  types  of  spectra  are  given  rise  to  by  the  basic  physical 
properties  of  the  oil,  we  expect  there  to  be  some  degree  of  dependence.  How- 
ever, we  also  expect  that  since  the  Infrared  and  fluorescence  spectra  are 
actually  measurements  taken  at  different  domains  of  the  physical  spectrum, 
there  should  be  some  independence  between  these  two  types  of  spectra.  To  begin 
with,  the  Euclidean  distance  between  each  pair  of  infrared  vectors  was  calcu- 
lated. We  represent  the  distance  between  vectors  Vj  and  Vj  as  d^^j  . Similarly, 
the  pairwise  distance  between  vectors  in  the  fluorescence  domain  were  calculated 
and  are  represented  by  d*j . The  distances  are  important  in  the  sense  that  in 
estimating  the  probability  density  structure  of  the  data,  the  distance  between 
vectors  representative  of  the  underlying  random  process  is  important  (9).  In 
addition,  most  of  the  algorithms  which  exist  for  determining  a match  between  a 
spill  and  a source  are  based  on  minimum  distance  or  nearest  neighbor  arguments. 

As  a first  step  of  analysis,  the  histogram  of  the  distances  for  each 
type  of  spectra  was  prepared.  These  histograms  are  listed  in  Figure  1 and  2. 
Comparison  of  the  two  histograms  is  interesting.  The  Infrared  histogram  peaks 
for  a value  of  distance  of  0.2  and  then  decays  in  a generally  exponential 
fashion.  In  contrast,  the  fluorescence  histogram  consists  of  a number  of  local 
minima  and  maxima.  It  also  contains  more  of  its  area  for  higher  values  of 
distance  than  the  infrared  histogram.  Due  to  the  discrepancy  in  the  shape  of 
these  two  histograms,  one  is  led  to  believe  there  is  some  degree  of  independence 
between  the  two  types  of  data  considered. 

In  order  to  gain  further  insight  into  the  structure  of  the  data, 
a scatter  diagram  of  the  distance  data  was  prepared  and  is  shown  in  Figure  3. 

This  diagram  plots  the  value  of  distance  in  the  fluorescence  domain  vertically 
versus  distance  in  the  infrared  domain  horizontally.  If  a true  linear  relation- 
ship existed  between  the  values  of  dj^j  and  d*j , we  would  see  a bunching  of  the 
scatter  diagram  points  along  the  regression  line  of  d^j  on  dij . If  a simple 
nonlinear  relationship  existed,  we  would  expect  to  see  this  displayed  as  bunching 
along  a curve  on  the  scatter  diagram.  However,  we  see  neither  of  the  above  in 
the  figure.  Instead,  we  see  a general  smearing  of  the  data  points  with  some 
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HISTOGRAM  OF  DISTANCES  IN  INFRARED  DOMAIN 
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Distance  Between  Vectors  in  Fluorescence  Domain 
FIGURE  2 

HISTOGRAM  OF  DISTANCES  IN  FLUORESCENCE  DOMAIN 
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bunching  of  points  near  the  lower  end  of  each  axis.  This  general  type  of 
structure  Indicates  that  there  Is  a good  deal  of  Independence  between  the  dis- 
tance calculated. 

The  correlation  coefficient  between  the  two  types  of  distances  Is 


defined  as: 


29  30 

(dij-d)(dtj-d*) 


(var  d)’-''^(var 

where  d Is  the  mean  of  the  Infrared  distances  and 
d Is  the  mean  of  the  fluorescence  distances. 

For  values  of  |p|  close  to  1,  there  Is  a strong  linear  relationship 
between  the  Infrared  and  fluorescence  distances.  For  |p|  small,  there  Is  not  a 
strong  linear  dependence  present.  For  the  data  considered,  p was  found  to  be 
0.A7,  which  indicates  that  there  is  not  a strong  linear  relationship. 

The  value  of  the  ratio  of  the  distance  d^j  to  d^j  as  calculated  is 
represented  by  r^j  (1  = 1,2,... 29;  j = 1+1,..., 30).  The  values  of  r^^j  were 
histogrammed  and  the  resulting  plot  is  given  in  Figure  4.  If  there  were  a high 
degree  of  dependence  between  the  two  variables,  we  would  expect  to  see  a high 
degree  of  structure  in  this  histogram.  For  Instance,  if  there  were  a highly 
linear  relationship,  we  would  expect  to  see  a high  narrow  spike  in  the  histogram 
of  the  distance  ratio.  In  contrast,  if  the  data  were  completely  independent, 
we  would  expect  a high  degree  of  uniformity  in  the  amplitude  of  the  histogram. 

By  examining  the  histogram  of  Figure  4,  we  see  a good  degree  of  uniformity  for 
values  of  the  ratio  less  than  2.  We  then  see  a high  degree  of  uniformity  but 
of  a lower  level  of  values  of  the  ratio  greater  than  2 but  less  than  7.5.  This 
indicates  that  these  types  of  data  are  fairly  Independent. 

Further  information  can  be  obtained  by  dividing  the  oils  of  the  data 
base  into  six  categories:  No.  2 fuel  oils.  No.  4 fuel  oils.  No.  5 fuel  oils. 

No.  6 fuel  oils,  crude  oils,  and  diesel  oils.  The  mean  of  each  of  these  classes 
was  estimated  using  the  fluorescence  and  the  Infrared  data.  The  distance 
between  the  means  for  each  pair  of  classes  was  calculated  for  each  type  of  data. 
The  results  are  presented  in  Figure  5.  It  is  readily  seen  that  the  classes  with 
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DISTANCES  BETWEEN  MEANS  FOR  VARIOUS  OIL  CLASSES 
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means  relatively  close  under  fluorescence  data  are  not  necessarily  close  under 
infrared  and  vice  versa.  This  indicates  a fairly  high  degree  of  data  indepen- 
dence. 

Based  on  this  data  we  feel  that  the  formula 
= 1 - 2(1-Pi)(1-P2) 

for  combining  P]^  and  P2  is  quite  conservative;  i.e.,  it  will  not  give  too  high 
a value  top.  However,  it  will  require  much  more  computation  before  we  would 
feel  justified  in  lowering  the  "2"  in  the  formula. 
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